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n-tuple power law widely exists in language, computer program code, DNA and music. After 
a vast amount of Zipf analyses of n-tuple power law from empirical data, we propose a model to 
explain the n-tuple power law feature existed in these information translational carriers. Our model 
is a preferential selection approach inspired by Simon's model which explained scaling law of single 
symbol in a sequence Zipf analysis. The kernel mechanism is neat and simple in our model. It can be 
simply described as a randomly copy and paste process, that is, randomly select a random segment 
from current sequence and attach it to the end repeatedly. The simulation of our model shows that 
n-tuple power law exists in model generated data. Furthermore, two estimation equations: the Zipf 
exponent and the minimal length of n-tuple for power law appears all correspond to empirical data 
well. Our model can also reproduce the symmetry breaking process of ATGC number differences in 
DNA data. 

PACS numbers: 89.75.Da, 89.75.Fb 



I. INTRODUCTION 

An intriguing feature of language is the Zipf's law Q, 
also known as Power Law or Pareto's Law 0. In the Zipf 
analysis, one calculate the frequencies of each word in an 
English text, and sort all the frequencies in rank-order, 
from the largest to the smallest. If we plot these fre- 
quencies against their rank-order in a log-log figure, then 
it will show a nearly straight line, with a slope £ » — 1 
[l| . So the relation of frequency and corresponding rank- 
order can be approximated by a power law form, k = r*, 
k for frequency, r for rank, and £, usually negative, is ref- 
erenced as Zipf exponent. Some researches have used an 
extended form k = (r + c)~ a 0,0, [10]. The constant c, 
however, does not have a substantial physical meaning. 
0, Q showed that instead of a power law, many different 
data in rank laws can be very well fitted by the integrand 
of a beta function. Zipf analysis were also extended to 
many other systems (sj, such as city sizes [l(J, DNA base 
pair sequences [III i an d distribution of firm sizes [lH • 

Many information carriers, such as language, program 
code, and DNA, can be considered as a symbol sequence. 
English text can be regarded as word sequences or letter 
sequences, where words are distinguished by some letters 
separated by space or punctuation. Chinese text can 
be perceived as Chinese character sequences, computer 
binary file as binary sequences, and DNA as ATGC se- 
quences. It's well known that statistics on single symbol 
of these sequences show no power law except for English 
text as word sequence 0, [l3j . For example, statistics on 
26 letters in an English novel, on Chinese characters in 
a Chinese novel, or on the 4 symbols ATGC in a DNA 
sequence, show no power law. 

Words are not easily separated in some languages as 



they are separated with spaces in English corpus. For 
example, in Chinese, compound words composing two or 
more Chinese characters could be created if they are se- 
mantically meaningful [141 ]. Also in other sequences, e.g., 
noncoding regions of DNA sequences, it is not easy to 



distinguish words 11]] . Literature reported that statis- 



*Electronic address: sliant@mail.bnu.edu.cn 
^Electronic address: wangdh@bnu.edu.cn 
t Electronic address: zhan@bnu.edu.cn 



tics on n-tuples in these symbol sequences show Zipf's 
law [111, 13 • Let's use an example to demonstrate the 
statistic method. Given an English letter sequence, "ab- 
bece" , its length is 6. Then we get 5 2-tuples: "ab" , "bb" , 
"be" , "cc" , "cc" . There are 4 unique 2-tuples here: "ab" , 
"bb", "be", "cc", with frequencies 1, 1, 1, 2 respectively. 
Formally, given a symbol sequence S — (s±, S2, • • • , St), 
its length is t. Then we get t — n + 1 n-tuples, imag- 
ing a window with width n slide from the beginning to 
the end. We can perform frequency statistics on these 
n-tuples. If the statistic results show power law, then 
we call it n-tuvle Power Law. The phrase n-tuple used 
in [H EE HH is also called n-gram in 0, [l^. We 
inherit n-tuple in this paper. 

[ll| reported that n-tuples of DNA (noncoding re- 
gions) as ATGC base pair sequence demonstrates a Zipf 
feature. This feature also exists in n-tuples of English 
text as letter sequence, and computer binary executable 
file as 0, 1 sequence. In that paper, n-tuple Zipf analyses 
were performed on DNA with n = 3 through 8, and on 
English text with n = 3 through 5, and n = 12 on com- 
puter binary executable file. [ll| also claimed that non- 
coding sequences bear more resemblance to a nature lan- 
guage than the coding sequences. [I|j] argued, however, 
to detect such linguistic feature, Zipf analysis should be 
applied with caution, since it cannot distinguish language 
from power-law noise, e.g., 1// noise. 

[14j give a detailed report that n-tuple power law exists 
in English text as word or letter sequence, and in Chinese 
text from 1-tuple to 5-tuple. 

However, our statistics show that n-tuple power law 
exists for a much larger n and in ranges different from 
For human DNA (note that we do not distinguish 
coding and noncoding regions), when n > 10 , statistics 
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show a better power law. For English text, a better power 



law shows when n > 4. also reported that the Zipf 
exponent £ is almost the same for different n, which is 
found to be increasing 0, HH • Our statistics also show 
an increasing Zipf exponent with n increases. In Sec- 
tion HIT] we'll give an estimate equation for the range of 
n, based on our model. 

proposed Markov process to analyze the 
n-tuple where the sequence is simplified to contain only 
two different symbols, 0, and 1. Conditional probability 
was calculated and gave results roughly similar to the one 
observed for long-range correlated sequences. [191 ] gives 
the rank-frequency distribution of n-tuples based on the 
assumption that the rank- frequency distribution of single 
symbols follows Zipf's law. 

Simon proposed a preferential selection model to ex- 
plain the power law distribution in numerous examples 
with this property found at that age However, Si- 
mon's model cannot explain the n-tuple Zipf feature. 
This paper will follow Simon's idea and set up a model 
to explain the n-tuple Zipf feature. 

In Section |TT1 we will give our statistics on English 
corpus, Chinese corpus, DNA, computer program coding 
source code, and computer executable binary file. We'll 
show that for a random sequence, a Zipf's law does not 
exist. In Section IIII1 we'll propose a model to explain 
the n-tuple power law. Later, we'll give an estimation 
equation for the Zipf exponent. We draw conclusions in 
section 4. 



large, we only perform statistics for relatively not too 
large n. Note that it's quite time consuming to perform 
n-tuple statistics on large data sets, e.g. human DNA. 
State to the art technique is needed. Some programming 
techniques we used is represented in [22j |. 

Fig-Hand Fig.[5]are the Zipf analyses of English corpus 
of Dickens' 15 novels as letter sequence and DNA ATGC 
sequence of human Y chromosome from [23] . For Dickens 
novels, when n > 4, it is already a well fit to a power law. 
For Y chromosome, however, it is when n > 10. 
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FIG. 1: n-tuple Zipf analysis of 15 novels of Dickens as English 
letter sequence, (a) The frequency-rank statistics on n-tuples 
for n — 2,3, 4, 13, 26. (b) p 2 of linear fit against n. We can 
see that n-tuple Zipf analyses show power law for n > 4. (c) 
Slope of linear fit against n. We can see that the slope tends 
to zero. 



II. EMPIRICAL RESULTS 
A. Zipf analysis of n-tuples 

plj ] mentioned that short horizontal line segments ap- 
peared at the bottom of a Zipf plot interfere with the 
statistics, and proposed that the last one or two of these 
line segments be discarded and the rest of them be repre- 
sented by their center point respectively. Here we adopt 
a similar method: for all the line segments, we preserve 
the right-most point and discard the rest. This is a much 
easier way to eliminate the interference. 

We do the traditional Zipf plot and then we perform a 
linear fit on the log- log plot. The slope is the Zipf expo- 
nent (negative), and the square of correlation coefficient 
p 2 G [0, 1] represents how well the fit is, with 1 a perfect 
straight line and not a line at all. We say it's a power 
law if p 2 is close to 1 (Typically when p 2 > 0.95). 

We perform statistics on English corpus, Chinese cor- 
pus, DNA, computer program coding source code, and 
computer executable binary file. We also perform statis- 
tics on DNA as 01 sequence with AT=0 and GC=1, mu- 
sic pieces as music note sequence, and actor sequence in 
drama. We show here only statistics on English corpus, 
and DNA sequence. The rest of the statistical results are 
presented in supplement material [22| . Because almost 
all n-tuples appear only one or two times when n is too 



n-tuple Zipf analysis 
(a) 









p ^ J 






linearfit slope 


> 


(c) 



20 40 60 80 



FIG. 2: n-tuple Zipf analysis of Y chromosome of human 
being as ATGC sequence. Source: NCBI Human Genome 
Resources [gjj. (a) The frequency-rank statistics on n-tuples 
for n = 4, 5, 6, 7, 11, 25, 40, 80. (b) p 2 of linear fit against n. 
We can see that n-tuple Zipf analyses show power law for 
n > 10. (c) Slope of linear fit against n. We can see that the 
slope tends to zero as in Fig. [TJ 

We can see from Fig. []Jc) and Fig. f2{c) that the slope 
(the Zipf exponent) tends to zero with n increa sing . We 
find that this is the case in all our statistics [2j|. In 
Section UTT1 we'll try to explain this feature based on our 
model. 
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B. No n-tuple power law in random sequence 

It should be noticed that n-tuple power law does not 
exist in random sequence. We generate a random ATGC 
sequence, the length of the sequence and the numbers of 
A, T, G, and C are the same as the real Y chromosome 
from the above source. In fact, such a sequence is a shuffle 
of the original one. Fig. [3] is an n-tuple Zipf analysis on 
such a shuffled sequence. We can see in Fig. (3^b) that 
the n-tuple curves are not linear when n < 14 because 
p 2 is low. Although when n > 14, the curve is linear and 
p 2 is high (close to 1), this is not evident for a power 
law. The reason is due to the fact that, in our statistic 
method, when n > 14, there are only a few points on the 
curve, exactly speaking, 8 points for n — 14, and 2 points 
for n — 19, in which case the p 2 of linear fit needs to be 
exact 1. 




FIG. 3: n-tuple Zipf analysis of a random (shuffled) ATGC 
sequence, (a) The frequency-rank statistics on n-tuples for 
n = 4,5,6,7,11,14. (b) p 2 of linear fit against n. We can 
see that p 2 is close to 1 for n > 14. This is, however, not 
an evidence for a power law. The reason is that almost all 
n-tuples appear only a few times when n > 14. For example, 
all n-tuples appear only one or two times for n — 19, so there 
are only two points on the curve. In this case, p 2 of linear 
fit needs to be exact 1 according to our statistic method. As 
mentioned in Section Til Al these should be discarded. For the 
rest part n < 14, it's not power law because p 2 is way too 
low. Compare (b) with Fig. [2fb) , the difference is clear: in 
Fig. [2{b), for the whole range between n = 10 to 80, p 2 is 
close to 1 which indicates power law. 

We can give an explanation of why n-tuple power law 
does not exist in random sequence. Given a random 
ATGC sequence, suppose each element in the sequence is 
an independent and identically distributed random vari- 
able, and the probability for ATGC is P A , P T , P G , P c > 
respectively. Then the probability that two elements at 
two given loci are the same is P = P\ + P^ + Pq + P 2 ,, 
< P < 1. Two n-tuples are the same means elements 
on every corresponding loci are the same, so the probabil- 
ity is P n . P n tends to exponentially with n increases. 
That's why almost all n-tuples appear only a few times 
when n > 14 in Fig. [3] When n is relatively small, the 
probability of each unique n-tuple could be easily cal- 



culated with Pa, Pt, Pg, Pc-, showing no way of being a 
power law. 

III. MODELING AT-TUPLE 

A. Simon's model 

Simon's model is a preferential selection model [2(| • It 
can be simply described as: randomly select a element 
from current sequence and attach it to the end of the 
sequence repeatedly. A formal description is as follows: 
at each step a new element is attached to the end of 
current symbol sequence. The newly attached element 
follows two rules: 

Rule 1 (new unique symbol rule). There is a con- 
stant probability a that the newly attached element 
is a new unique symbol that never appeared. 

Rule 2 (preferential selection rule). Else the newly 
attached element is randomly selected from the cur- 
rent sequence. 

From these two rules, Simon proved that power law 
will appear and the slope (Zipf exponent) is a — 1. 

Note that although Simon's model is still valid when 
a is very close to or 1, it is not easy to observe a power 
law at this circumstance due to the fact that the sequence 
length needs to be very large to exhibit any meaningful 
feature. 

We've performed n-tuple Zipf analysis on Simon's 
model generated sequence, and found that n-tuple power 
law does not exist. The plot is similar to Fig. [3J So, we 
need a new model that can compromise n-tuple power 
law. 



B. Model description and simulation results 

Our model is a consecutive subsequence preferential 
selection model inspired by Simon's model. It can be 
simply described as a randomly copy and paste process: 
randomly select a random consecutive subsequence from 
current sequence and attach it to the end repeatedly. A 
formal description is as following: 

Step 0. Given 4 parameters: the length T m i n of initial 
symbol sequence, the number U of unique sym- 
bols, the discrete distribution D which gener- 
ates random positive integers, and the maximum 
length T max of symbol sequence. 

Step 1. Generate an initial symbol sequence, in which 
each element is randomly selected from U unique 
symbols. 

Step 2. Suppose the current sequence is C, and the 
length is t. Generate a random integer a, which 
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has a uniform distribution in [l,t], as the start 
point of subsequence. Generate a random length 
b, which has a distribution D, as the length of 
subsequence. Ifa + 6<t+l, go to Step 3; else, 
repeat this step. (This is to make sure the ran- 
domly selected subsequence is inside C) 

Step 3. Suppose R is the randomly selected subsequence 
in C, starting from a with length b. R is copied 
and attached to the end of C and assign C as the 
new sequence. Update the length of C: t = t + b. 

Step 4. If t > T max , stop; else, go to Step 2. 

Fig. [4] is an n-tuple Zipf analysis on this model gener- 
ated sequence. Parameters of the model are tuned to real 
DNA as in Fig. [2] We can see that n-tuple power law 
does exist in our model and it well replicates the DNA 
Zipf analysis as in Fig. [2] 




FIG. 4: n-tuple Zipf analysis of our model generated data. 
Two related parameters of our model are set up accord- 
ing to real human Y chromosome as in Fig. |2j U = 
4, Tmax = 25652954. T min = 100, D is exponential 
distribution with PDF (Probability Distribution Function) 
0.02e _0 ' 02a: (use the integer part of generated random num- 
bers), (a) The frequency-rank statistics on n-tuples for 
n = 4, 5, 6, 7, 11, 25, 40, 80. (b) p 2 of linear fit against n. We 
can see that n-tuple Zipf analyses show power law for n > 10. 
(c) Slope of linear fit against n. We can see that the slope 
tends to zero. These are the same as in real DNA data shown 
in Fig. H 



C. Model Analysis 

Let's begin with an example. Suppose the current 
sequence is C — (si,s 2 , ■■■ , s&), length 8. There are 
6 3-tuples: A\ = (si,s 2 ,s 3 ), A 2 = (33,83,34), 
As = {se, S7, s$). Randomly select a subsequence from 
C, say, starting at 3 with length 4, that is R = 
(S3, S4, S5, sq). Copy and attach R to the end of C, 
now C = (s 1 ,s 2 ,--' , s a , s 3 , S4, s 5 , s 6 ), length 12. There 
are 10 3-tuples now: A\, A2, ■ ■ ■ ,Aq are the same, and 
A 7 = (s 7 ,s 8 ,s 3 ), A s = (s 8 ,s 3 ,s 4 ), A 9 = (s 3 ,s 4 ,s 5 ), 



A10 = (s4,S5,S6) are newly formed. We can see that 
Ag = A3, j4io = A4. As of Aj or j4g, it's unknown if it 
equals to any of Ai, A2, ■ ■ • , A&. If it's not, then a new 
unique 3-tuple is introduced. Fig. [5] shows this whole 
process. 
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FIG. 5: A demonstration of our model. R = (S3, S4, S5, s^) is 
randomly selected, copied and attached to the end. 

Now we can give a formal description. Suppose 
the current sequence is C — (si,S3, ••■ , St), length 
t. Let's consider n-tuples, there are t — n + 1 n- 
tuples: Ax = (s x , s 2) • • • , s„), A 2 = (s 2 , s 3 , • • ■ , s„+i), 

At-, l+ l = (s t -n+l,S t -n+2,- ■ ■ , «*)■ A SUbse- 

quence i? = (s , s +i, • • • ,s a +h-i) is randomly se- 
lected from C, starting at a with length 6, and is 
copied and attached to the end of C. Now C — 
(si,s 2 ,--- ,s t ,s a ,s a+ i, ■ ■ ■ ,s a+6 _i), length t + b. There 
are t — n + 1 + b n-tuples now: Ai,A2,--- , At— n +i are the 
same, A t _„ +2 , A t _„ +3 , • • ■ , A t - n +i+b are newly formed. 
The number of these newly formed is b. These newly 
formed can be divided into two cases, Case IN and Case 
OUT as Fig. [5] shows. 

Case IN. If b > n, among these newly formed b n- 
tuples, the last b — n + 1 n-tuples fall inside of 
R, and equal to A a , A a+ i, • • • , A a+ (,~ n respectively. 
Suppose Pi n is the probability that a newly formed 
n-tuples belongs to this case. 

Case OUT. If b > n, among these newly formed b n- 
tuples, the first n — 1 n-tuples fall (partly) out- 
side of R. If b < A, all the newly formed b n- 
tuples fall (partly) outside of R. It's unknown 
whether these newly formed n-tuples equal to any 
of j4i,j4 2 ,--- ,A t - n +i- If it's not, then a new 
unique n-tuple is introduced. 

Unfortunately, we are unable to give a strict analysis 
for Case OUT. Therefore we give the following assump- 
tion. 

Assumption 1. The probability that an element in 
Case OUT is a duplicated one is very small and 
can be neglected when n is large. 

This assumption is necessary for the following discus- 
sion. One may doubt the reasonableness of this assump- 
tion. Well, the most convincible evidence could be that 
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our theoretical equation based on this assumption fit the 
model well, see Section MI Dl and Section MI El The fact 
that the number of all possible n-tuples increases expo- 
nentially with n increases also favors this assumption, see 
Section MI El We hope that future work can give a strict 
analysis for this assumption. 

Starting from this assumption, if we perceive 
Ai, A2, ■ ■ ■ in our model as the symbols in Simon's model, 
Case IN is equivalent to the preferential selection rule in 
Simon's model (Rule 2), and Case OUT can correspond 
to the new unique symbol rule in Simon's model (Rule 
1). Now we can utilize the proof of Simon's model and 
prove the existence of n-tuple power law in our model. 
We can also calculate Zipf exponent. In the next section, 
we show that the calculated Zipf exponent corresponds 
well to the model. This demonstrates the validity of this 
assumption. 
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FIG. 6: Compare Eq. |T} with model simulation result as in 
Fig. |4jc) . The parameters are the same as in Fig. [4] We can 
see that Eq. (JTJ) fits the model well when n is large (n > 10), 
which indicate the validity of assumption 1. 



D. Slope (Zipf exponent) 

Now we give an estimate for the slope (Zipf exponent) 
of our model. 

Consider Pj„. Given that the PDF of the distribution 
D is /. According to step 2 in Section MI Bl we need 
to repeatedly generate random integer b by distribution 
D until a + b < t + 1. However, we suppose that the 
generated random integer b always satisfies a + b < t+l. 
This is reasonable because this is almost the case for 
any PDF when t is large. Furthermore, assume that the 
expectation corresponding to the distribution D exists. 

Now, according to Case IN, we have 
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or in the integral form 



p. — 
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We use Pi n to estimate the slope (Zipf exponent). Ac- 
cording to Assumption 1 and Simon's model, P ln = 1—a. 
So we have 



slope = a — 1 



.p. 

J 1.1 
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Jo +0 ° xf(x)dx 
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Eq. (fTJ is got with no requirement for the detailed form 
of the distribution function. The deduction only requires 
the existence of the expectation. Fig. [5] is a comparison 
of Eq. (fTJ) and actual simulation result of our model. We 
can see that when n is large (n > 10), Eq. {1} gives the 
same result as our model gives. This demonstrates the 
validity of assumption 1. 

We now compute the limit of Eq. (fTJ). Be- 
cause / is the PDF of distribution D which gen- 
erates random positive integers, J + °° f{x)dx is ab- 
solute convergent, so lim JW+00 J^°° f(x)dx = 0. 



According to the assumption that the expectation 
corresponding to D exists, J + °° xf(x)dx is abso- 
lute convergent, so lim„^ +oc xf(x)dx = 0. 
Note that < J^°° nf(x)dx < J °° xf(x)dx, so 

lim„^ +00 fn°° nf(x)dx = 0. From these, we can extract 
the numerator of Eq. (fT]) to 3 parts and have 
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Secondly, let's see the derivative of Eq. (fTJ). Notice that 
the numerator is an integral depending on parameters, we 
have 



A(/;v.. + i>/Md.)=-/;~ 



f(x)dx-f(n) < 0, 



hence 



- (slope) 
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> 0. 



The fact that the limit of the slope is zero, and the 
derivative is greater or equal to zero explains why all the 
slopes in our statistics and simulations are monotonically 
increasing and tend to zero when n is large. Notice that 
when n is small, the slopes in our statistics and simula- 
tions are irregular. This is because Assumption 1 is not 
satisfied. 



E. The threshold of n 

We found from the statistics and the simulation of our 
model that n-tuple power law doesn't exist when n is 
small. We need an estimation of the threshold of n that 
n-tuple power law appears when n is greater than the 
threshold. 
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Following notations in the above sections, there are 
only U n possible unique n-tuples. Because a is the 
probability that a newly generated n-tuple is a new one 
that has never appeared before, the expected number of 
unique n-tuplcs in the sequence C with length t is t. So 
we have at < U n , i.e. 

n > logy t + logy a. (2) 

We can easily infer the followings. 

For any given n, we can have a sufficiently large t so 
that Eq. ([2]) fails to hold. Intuitively, for a given U and a 
given n, when the sequence length increases, the proba- 
bility that a newly generated n-tuple did not occur before 
is decreasing, instead of being approximately a constant. 
This is going to be further addressed in the conclusion 
section. 

For any given i, we can have a sufficiently small n 
that Eq. ([2]) fails to hold; in other words, n need to be 
sufficiently large so that Eq. ([2]) can hold. 

Therefore, we use Eq. ([2]) to estimate the threshold of 
n. In all the data we analyze in this paper, t is very large 
and a is not close to zero, so we have a simpler estimation 
form 

n > logy t. (3) 

Let's compare Eq. ([3]) with actual statistic results. 
In Fig. [TJ we perform statistics on 15 novels of Dick- 
ens, as English letter sequence. There are totally 
17211736 letters, and 26 possible unique letters (a-z), 
log 26 1 7211736 = 5.113 7 53. We can see in Fig. [TJ that 
n-tuple power law exists for about n > 4. In Fig. [2j the 
DNA sequence length is 25652954, with 4 possible unique 
symbols (ATGC), log 4 25652954 = 12.306311, and wc 
can see that n-tuple power law exists for about n > 10. 

This is an interesting result. It reveals the relation be- 
tween Power Law and diversity. As mentioned above, in 
order to show Power Law, the number of unique elements 
in a sequence should not be too small or too large, i.e. a 
proper diversity should be maintained. 

A subtle problem should arouse some attention here. 
If Simon's model is a sufficient and necessary condition 
for a power law curve, then violating the above inequality 
means violating Simon's model hence there is no power 
law. Unfortunately, Simon's model is only a sufficient 
condition for power law, not a necessary one. This means 
this section is not a strict theoretical deduction. We hope 
future work can improve this. 

F. Model parameter settings 

There are 4 parameters in our model, as mentioned 
in Section IlII Bl the initial length T m i n , the number U 
of unique symbols, the discrete distribution D, and the 
maximum length T max . 

It's obvious that our model requires 1 < T min <C T max 
and 1 < U <C T max . We did some sensitivity analy- 
ses and find that our model is not sensitive to T m i n , U 



and T max . While other parameters remain the same, we 
test different values for these 3 parameters, for example, 
T mm = 100, T min = 1000, U = 4,U= 100, T max = 10 6 , 
Tmax — 10 7 , etc., and find that n-tuple power law al- 
ways exist in the simulations, with only the threshold of 
n varies a little according to the discussions of the above 
section. 

However, there are certain requirements for the dis- 
tribution D. As mentioned in Section IIII Al in order 
to observe a power law distribution, the probability a 
that a new unique element is introduced, should not be 
very close to 0. According to Eq. (UJ, this requires that 
j^°°{x — n + l)f(x)dx is not very close to when n is 
large, i.e. P(X > n) is not very close to when n is large. 
Such a distribution can be a distribution corresponding 
to a large expectation (e.g. an exponential distribution 
with expectation 50 as in Fig. Q] ) , a distribution that has 
a fat tail, or even a degenerate distribution that has only 
one large value. 

G. The symmetry breaking process 

The number of each ATGC in DNA sequence is not the 
same. We calculated the entropy of human DNA ATGC 
data [HI . The entropy is defined as — pi log 2 Pi , where 
Pi is the portion of ATGC in the sequence, i = 1,2, 3, 4. 
The entropy ranges from 1.959566 for chromosome 4 to 
1.999227 for chromosome 19 with an average of 1.97733 
and a standard deviation of 0.01084. 

This stylized fact is reproduced with our model. ATGC 
in our model initially follows uniform distribution. Wc 
calculate the entropy for 1000 simulation runs of our 
model, with the same parameters for that in Fig. 2J We 
find that after the growth process, the entropy of our 
model ranges from 1.74984 to 1.99991 with an average of 
1.98894 and a standard deviation of 0.01652. The sym- 
metry breaking process is due to the selection at the early 
steps. 

IV. CONCLUSION 

We do a lot of n-tuple Zipf analyses to a very large 
n in a wide variety of real data ranging covering En- 
glish corpus, Chinese corpus, computer program source 
code and binary file, music, and ATGC sequence from 
human DNA (see supplement material [13]). These anal- 
yses showed the trend that when n increases to a certain 
value, there exists a power law for sure and the slope 
tends to zero. We also showed that there is no n-tuple 
power law for random data by reshuffling the DNA data. 

We perceive this n-tuple scaling law feature in a vari- 
ety of information carriers as that a meaningful "motif" 
in information translational carrier in each field needs a 
certain length of symbols to express a relatively complete 
"sentence" , hence a motif has a specific characteristic se- 
quence length. 
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Instead of modeling this n-tuple power law feature in 
a 1-bit Markov process [H, HH Il6j|. we set up a model 
to explain the n-tuple power law feature. The model 
is a simple copy and paste process, inspired by Simon's 
model. The model is tuned to DNA data and it showed 
its effectiveness in reproducing the n-tuple power law fea- 
ture. We hope this model could help to figure out how 
language and DNA are generated. 

Based on Assumption 1 that Case OUT are almost all 
new unique n-tuples, we also calculated the Zipf expo- 
nent and proved that it tends to zero, the same trend as 
real data shows. However, we hope future work can give 
a strict analysis for Case OUT and Assumption 1. More- 
over, when the length of the sequence increases with our 
growth mechanism, the probability that Case OUT is a 
new element decreases. [24|, [H[ discussed Simon's model 
under this circumstance. 

This model also reproduces the symmetry breaking 
process of ATGC inequality in DNA sequence with an 
average entropy value quite close to the real one. 

We should point out here that real data have some as- 
pects that this model does not always well address. We 
do a lot of analyses base on DNA data. We do not cali- 



brate this model to other data sources due to the consid- 
eration that analysis based on DNA data already gives 
the main results of this model. Calibrate this model to 
other data sources, which is quite time consuming, may 
not show something new. There are other features in 
empirical data. For example, in English corpus as letter 
sequences and Chinese modern corpus as Chinese char- 
acter sequences, the arrival of a given symbol is a Poisson 
process, while this is not the case for English corpus as 
word sequences , Chinese ancient corpus and DNA se- 
quence. This model generated sequences, however, are 
always Poisson processes (see supplement material [22| ) ■ 
We also find that long range correlation, which is found 
in noncoding region of real DNA [2^|, does not exist in 
this model generated sequences. 
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